Medical Dataset - Segmenting Patients¶

InĀ [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
InĀ [2]:
df=pd.read_csv('patient_dataset.csv', index_col=0)
df.head()
Out[2]:
age gender chest pain type blood pressure cholesterol max heart rate exercise angina plasma glucose skin_thickness insulin bmi diabetes_pedigree hypertension heart_disease Residence_type smoking_status triage
0 40.0 1.0 2.0 140.0 294.0 172.0 0.0 108.0 43.0 92.0 19.0 0.467386 0.0 0.0 Urban never smoked yellow
1 49.0 0.0 3.0 160.0 180.0 156.0 0.0 75.0 47.0 90.0 18.0 0.467386 0.0 0.0 Urban never smoked orange
2 37.0 1.0 2.0 130.0 294.0 156.0 0.0 98.0 53.0 102.0 23.0 0.467386 0.0 0.0 Urban never smoked yellow
3 48.0 0.0 4.0 138.0 214.0 156.0 1.0 72.0 51.0 118.0 18.0 0.467386 0.0 0.0 Urban never smoked orange
4 54.0 1.0 3.0 150.0 195.0 156.0 0.0 108.0 90.0 83.0 21.0 0.467386 0.0 0.0 Urban never smoked yellow

Defining Problem Statement and perform Exploratory Data Analysis¶

Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first.

As the dataset is seen

Based on patient symptoms,

    Identify patients needing immediate resuscitation; 
    To assign patients to a predesignated patient care area, 
    Thereby prioritizing their care; 
    And to initiate diagnostic/therapeutic measures as appropriate 
  • The dataset includes demographic, lifestyle, and health-related features, such as age, gender, cholesterol levels, blood pressure, BMI, diabetes history, and smoking status.

  • Apply unsupervised learning techniques such as K-Means, Gaussian Mixture Model and Hierarchical Clustering to segment the data into meaningful clusters.

      The study will explore whether these clusters reveal distinct patient groups that could be useful for medical research, risk stratification, or personalized treatment plans. 

Observations on the data types of all the attributes¶

InĀ [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6962 entries, 0 to 5109
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   age                6962 non-null   float64
 1   gender             6961 non-null   float64
 2   chest pain type    6962 non-null   float64
 3   blood pressure     6962 non-null   float64
 4   cholesterol        6962 non-null   float64
 5   max heart rate     6962 non-null   float64
 6   exercise angina    6962 non-null   float64
 7   plasma glucose     6962 non-null   float64
 8   skin_thickness     6962 non-null   float64
 9   insulin            6962 non-null   float64
 10  bmi                6962 non-null   float64
 11  diabetes_pedigree  6962 non-null   float64
 12  hypertension       6962 non-null   float64
 13  heart_disease      6962 non-null   float64
 14  Residence_type     6962 non-null   object 
 15  smoking_status     6962 non-null   object 
 16  triage             6552 non-null   object 
dtypes: float64(14), object(3)
memory usage: 979.0+ KB

Missing value check¶

InĀ [4]:
print('Missing Values in the dataset ')
df.isna().sum()
Missing Values in the dataset 
Out[4]:
age                    0
gender                 1
chest pain type        0
blood pressure         0
cholesterol            0
max heart rate         0
exercise angina        0
plasma glucose         0
skin_thickness         0
insulin                0
bmi                    0
diabetes_pedigree      0
hypertension           0
heart_disease          0
Residence_type         0
smoking_status         0
triage               410
dtype: int64
InĀ [5]:
print("Total Missing Values ")
df.isna().sum().sum()
Total Missing Values 
Out[5]:
411

Outlier detection¶

InĀ [6]:
# Extracting Numerical data from the pool
numeric_data=df.select_dtypes('number')
numeric_data.head(5)
Out[6]:
age gender chest pain type blood pressure cholesterol max heart rate exercise angina plasma glucose skin_thickness insulin bmi diabetes_pedigree hypertension heart_disease
0 40.0 1.0 2.0 140.0 294.0 172.0 0.0 108.0 43.0 92.0 19.0 0.467386 0.0 0.0
1 49.0 0.0 3.0 160.0 180.0 156.0 0.0 75.0 47.0 90.0 18.0 0.467386 0.0 0.0
2 37.0 1.0 2.0 130.0 294.0 156.0 0.0 98.0 53.0 102.0 23.0 0.467386 0.0 0.0
3 48.0 0.0 4.0 138.0 214.0 156.0 1.0 72.0 51.0 118.0 18.0 0.467386 0.0 0.0
4 54.0 1.0 3.0 150.0 195.0 156.0 0.0 108.0 90.0 83.0 21.0 0.467386 0.0 0.0
InĀ [7]:
plt.figure(figsize=(15, 15), layout="constrained", frameon=True)
i=1
for col in numeric_data.columns:
    plt.subplot(4, 4, i)
    sns.boxplot(df[col], color="#e63946")
    plt.title(col)
    i += 1
plt.show()
No description has been provided for this image
  • As from the above Graph we see the potential Features for Outliers can be :
    • cholesterol
    • plasma glucose
    • insulin
    • bmi
    • diabetes_pedigree
InĀ [8]:
outlier_features=df[['cholesterol', 'plasma glucose', 'insulin', 'bmi', 'diabetes_pedigree']]
InĀ [9]:
plt.figure(figsize=(15, 8), layout="constrained", frameon=True)
i = 1
for col in outlier_features:
    plt.subplot(2, 3, i)
    sns.histplot(df[col], kde=True, color="#2a9d8f")
    plt.title(col)
    i += 1
plt.show()
No description has been provided for this image

Relationship between important variables¶

InĀ [10]:
plt.figure(figsize=(15, 6))
sns.lineplot(
    x=df["hypertension"],
    y=df["max heart rate"],
    hue=df["triage"],
    errorbar=None,
    hue_order=["yellow", "orange", "green", "red"],
)
plt.title("Max Heart Rate vs. Hypertension")
plt.show()
No description has been provided for this image
InĀ [11]:
plt.figure(figsize=(15, 6))
sns.barplot(
    x=df["exercise angina"],
    y=df["max heart rate"],
    hue=df["triage"],
    hue_order=["yellow", "orange", "green", "red"],
)
plt.title("Max Heart Rate vs. Exercise Angina")
plt.show()
No description has been provided for this image
InĀ [12]:
plt.figure(figsize=(15, 6))
sns.violinplot(
    x=df["heart_disease"], y=df["age"], palette="coolwarm", hue=df["heart_disease"]
)
plt.title("Age vs. Heart Disease")
plt.show()
No description has been provided for this image
InĀ [13]:
plt.figure(figsize=(15, 8))
sns.lineplot(y=df["bmi"], x=df["age"], hue=df["smoking_status"])
plt.show()
No description has been provided for this image
InĀ [14]:
plt.figure(figsize=(15, 8))
sns.barplot(x=df["chest pain type"], y=df["age"], hue=df["smoking_status"])
plt.title("Chest Pain Type vs. Age on Smoking Status")
plt.show()
No description has been provided for this image
InĀ [15]:
plt.figure(figsize=(20,12))

plt.subplot(2,2,1)
sns.histplot(df["age"], kde=True, color="#cdb4db")
plt.title("Age Distribution")

plt.subplot(2, 2, 2)
sns.histplot(df["cholesterol"], kde=True, color="#219ebc")
plt.title("Cholesterol Distribution")

plt.subplot(2, 2, 3)
sns.histplot(df["max heart rate"], kde=True, color="#9b5de5")
plt.title("Max Heart Rate Distribution")


plt.subplot(2, 2, 4)
sns.histplot(df["blood pressure"], kde=True, color="#3a5a40")
plt.title("Blood Pressure Distribution")

plt.show()
No description has been provided for this image



Data Preprocessing¶

Imputation¶

InĀ [16]:
# Handling Missing values 
missing_values_data=df.isna().sum()[df.isna().sum()>0]

sns.heatmap(df.isnull(), cbar=False, cmap="viridis", yticklabels=False)
plt.title("Heatmap of Missing Values")
plt.show()
No description has been provided for this image
InĀ [17]:
#Calculating the percentage of missing values
missing_percentage = round((missing_values_data / len(df)) * 100, 2)

missing_data_summary = pd.DataFrame(
    {
        "Missing Values": missing_values_data[missing_values_data > 0],
        "Percentage (%)": missing_percentage[missing_values_data > 0],
    }
).sort_values(by="Percentage (%)", ascending=False)

print(missing_data_summary)
        Missing Values  Percentage (%)
triage             410            5.89
gender               1            0.01

InĀ [18]:
# Handling triage
# df['triage'].value_counts()
InĀ [19]:
# Null values filled with mode
# df["triage"] = df["triage"].fillna("yellow")

InĀ [20]:
df["gender"].value_counts()
Out[20]:
gender
1.0    3703
0.0    3258
Name: count, dtype: int64
InĀ [21]:
df["gender"] = df["gender"].fillna(0.0)
InĀ [22]:
df.isnull().sum()
Out[22]:
age                    0
gender                 0
chest pain type        0
blood pressure         0
cholesterol            0
max heart rate         0
exercise angina        0
plasma glucose         0
skin_thickness         0
insulin                0
bmi                    0
diabetes_pedigree      0
hypertension           0
heart_disease          0
Residence_type         0
smoking_status         0
triage               410
dtype: int64


Outlier Treatment¶

InĀ [23]:
plt.figure(figsize=(15, 8), layout="constrained", frameon=True)
i = 1
for col in outlier_features:
    plt.subplot(2, 3, i)
    sns.histplot(df[col], kde=True, color="#2a9d8f")
    plt.title(col)
    i += 1
plt.show()
No description has been provided for this image

Transform Data to Reduce Impact

  • Log Transformation (Best for right-skewed data)
InĀ [24]:
outlier_features.columns
Out[24]:
Index(['cholesterol', 'plasma glucose', 'insulin', 'bmi', 'diabetes_pedigree'], dtype='object')
InĀ [25]:
# cholesterol
df["cholesterol"] = np.log1p(df["cholesterol"])

InĀ [26]:
# plasma glucose
df["plasma glucose"] = np.log1p(df["plasma glucose"])

InĀ [27]:
# insulin
df["insulin"] = np.log1p(df["insulin"])

InĀ [28]:
# bmi
df["bmi"] = np.log1p(df["bmi"])

InĀ [29]:
# diabetes_pedigree
df["diabetes_pedigree"] = np.log1p(df["diabetes_pedigree"])

InĀ [30]:
plt.figure(figsize=(15, 8), layout="constrained", frameon=True)
i = 1
for col in outlier_features:
    plt.subplot(2, 3, i)
    sns.histplot(df[col], kde=True, color="#2a9d8f")
    plt.title(col)
    i += 1
plt.show()
No description has been provided for this image


Encoding all the categorical attributes¶

InĀ [31]:
categorical_data=df.select_dtypes("object")
categorical_data.head(5)
Out[31]:
Residence_type smoking_status triage
0 Urban never smoked yellow
1 Urban never smoked orange
2 Urban never smoked yellow
3 Urban never smoked orange
4 Urban never smoked yellow
InĀ [32]:
# Encoding residence_type

df['Residence_type'].value_counts().index
Out[32]:
Index(['Urban', 'Rural'], dtype='object', name='Residence_type')
InĀ [33]:
Residence_type_map = {"Urban": 0, "Rural": 1}

df['Residence_type']=df['Residence_type'].map(Residence_type_map)

df['Residence_type']
Out[33]:
0       0
1       0
2       0
3       0
4       0
       ..
5105    0
5106    0
5107    1
5108    1
5109    0
Name: Residence_type, Length: 6962, dtype: int64

InĀ [34]:
# Encoding smoking_status

df["smoking_status"].value_counts().index
Out[34]:
Index(['never smoked', 'Unknown', 'formerly smoked', 'smokes'], dtype='object', name='smoking_status')
InĀ [35]:
smoking_status = {"never smoked":0, "Unknown":2, "formerly smoked":0.5, "smokes":1}

df["smoking_status"] = df["smoking_status"].map(smoking_status)

df["smoking_status"]
Out[35]:
0       0.0
1       0.0
2       0.0
3       0.0
4       0.0
       ... 
5105    0.0
5106    0.0
5107    0.0
5108    0.5
5109    2.0
Name: smoking_status, Length: 6962, dtype: float64

InĀ [36]:
# triage
df["triage"].value_counts().index
Out[36]:
Index(['yellow', 'green', 'orange', 'red'], dtype='object', name='triage')
InĀ [37]:
# triage_map = {"yellow": 0, "orange": 1, "green": 2, "red": 3}

# df['triage'] = df['triage'].map(triage_map)

# df['triage']


Standardization¶

InĀ [88]:
X = df.drop("triage", axis=1)
y = df["triage"]
InĀ [89]:
from sklearn.preprocessing import StandardScaler

scaler= StandardScaler()

X[X.columns] = scaler.fit_transform(X)
X
Out[89]:
age gender chest pain type blood pressure cholesterol max heart rate exercise angina plasma glucose skin_thickness insulin bmi diabetes_pedigree hypertension heart_disease Residence_type smoking_status
0 -1.465884 0.938135 1.173314 1.410374 3.129483 0.549734 -0.256573 0.485469 -0.603531 -1.109043 -1.256305 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
1 -0.709841 -1.065945 1.970952 2.339167 -0.086923 -0.485357 -0.256573 -0.870489 -0.428764 -1.247255 -1.463029 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
2 -1.717898 0.938135 1.173314 0.945977 3.129483 -0.485357 -0.256573 0.123639 -0.166614 -0.459755 -0.521504 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
3 -0.793846 -1.065945 2.768591 1.317494 1.046546 -0.485357 3.897525 -1.021924 -0.253998 0.458231 -1.463029 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
4 -0.289817 0.938135 1.970952 1.874771 0.437322 -0.485357 -0.256573 0.485469 1.449976 -1.756125 -0.872181 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5105 1.894305 -1.065945 -0.421962 0.063623 -1.150619 0.161575 -0.256573 -0.460738 -1.127831 -0.099801 -1.337727 0.033263 3.602766 -0.202792 -0.751562 -0.771064
5106 1.978310 -1.065945 -0.421962 0.620899 -0.981776 -0.226584 -0.256573 1.036404 -1.477364 -1.317504 1.636766 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
5107 1.978310 -1.065945 -0.421962 0.806658 0.092503 -1.455754 -0.256573 -0.494609 -0.690914 -0.907201 0.587230 0.033263 -0.277565 -0.202792 1.330562 -0.771064
5108 -0.541832 0.938135 -0.421962 0.620899 -0.817154 -0.097198 -0.256573 2.096238 -0.996756 -1.041048 -0.106963 0.033263 -0.277565 -0.202792 1.330562 -0.149607
5109 -1.129865 -1.065945 -0.421962 0.713778 -0.234070 0.549734 -0.256573 -0.393462 0.008152 0.185336 -0.017066 0.033263 -0.277565 -0.202792 -0.751562 1.714764

6962 rows Ɨ 16 columns

InĀ [90]:
y
Out[90]:
0       yellow
1       orange
2       yellow
3       orange
4       yellow
         ...  
5105    yellow
5106    yellow
5107    yellow
5108     green
5109    yellow
Name: triage, Length: 6962, dtype: object
InĀ [41]:
df.to_csv('cleaned_patient_dataset.csv', index=False)

Correlation between all the attributes¶

InĀ [91]:
plt.figure(figsize=(15, 7), layout="constrained")
sns.heatmap(data=X.corr(), annot=True, cmap='Blues')
plt.show()
No description has been provided for this image



Model Training¶

InĀ [2]:
df=pd.read_csv('cleaned_patient_dataset.csv')
df.head()
Out[2]:
age gender chest pain type blood pressure cholesterol max heart rate exercise angina plasma glucose skin_thickness insulin bmi diabetes_pedigree hypertension heart_disease Residence_type smoking_status triage
0 40.0 1.0 2.0 140.0 5.686975 172.0 0.0 4.691348 43.0 4.532599 2.995732 0.383483 0.0 0.0 0 0.0 yellow
1 49.0 0.0 3.0 160.0 5.198497 156.0 0.0 4.330733 47.0 4.510860 2.944439 0.383483 0.0 0.0 0 0.0 orange
2 37.0 1.0 2.0 130.0 5.686975 156.0 0.0 4.595120 53.0 4.634729 3.178054 0.383483 0.0 0.0 0 0.0 yellow
3 48.0 0.0 4.0 138.0 5.370638 156.0 1.0 4.290459 51.0 4.779123 2.944439 0.383483 0.0 0.0 0 0.0 orange
4 54.0 1.0 3.0 150.0 5.278115 156.0 0.0 4.691348 90.0 4.430817 3.091042 0.383483 0.0 0.0 0 0.0 yellow
InĀ [3]:
X=df.drop('triage', axis=1)
X
Out[3]:
age gender chest pain type blood pressure cholesterol max heart rate exercise angina plasma glucose skin_thickness insulin bmi diabetes_pedigree hypertension heart_disease Residence_type smoking_status
0 40.0 1.0 2.0 140.0 5.686975 172.0 0.0 4.691348 43.0 4.532599 2.995732 0.383483 0.0 0.0 0 0.0
1 49.0 0.0 3.0 160.0 5.198497 156.0 0.0 4.330733 47.0 4.510860 2.944439 0.383483 0.0 0.0 0 0.0
2 37.0 1.0 2.0 130.0 5.686975 156.0 0.0 4.595120 53.0 4.634729 3.178054 0.383483 0.0 0.0 0 0.0
3 48.0 0.0 4.0 138.0 5.370638 156.0 1.0 4.290459 51.0 4.779123 2.944439 0.383483 0.0 0.0 0 0.0
4 54.0 1.0 3.0 150.0 5.278115 156.0 0.0 4.691348 90.0 4.430817 3.091042 0.383483 0.0 0.0 0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6957 80.0 0.0 0.0 111.0 5.036953 166.0 0.0 4.439706 31.0 4.691348 2.975530 0.383483 1.0 0.0 0 0.0
6958 81.0 0.0 0.0 123.0 5.062595 160.0 0.0 4.837868 23.0 4.499810 3.713572 0.383483 0.0 0.0 0 0.0
6959 81.0 0.0 0.0 127.0 5.225747 141.0 0.0 4.430698 41.0 4.564348 3.453157 0.383483 0.0 0.0 1 0.0
6960 51.0 1.0 0.0 123.0 5.087596 162.0 0.0 5.119729 34.0 4.543295 3.280911 0.383483 0.0 0.0 1 0.5
6961 44.0 0.0 0.0 125.0 5.176150 172.0 0.0 4.457598 57.0 4.736198 3.303217 0.383483 0.0 0.0 0 2.0

6962 rows Ɨ 16 columns

InĀ [4]:
y=df['triage']
y
Out[4]:
0       yellow
1       orange
2       yellow
3       orange
4       yellow
         ...  
6957    yellow
6958    yellow
6959    yellow
6960     green
6961    yellow
Name: triage, Length: 6962, dtype: object

K-Means Clustering¶

InĀ [10]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

inertia = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X, kmeans.labels_))

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(K_range, inertia, marker="o")
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method")

plt.subplot(1, 2, 2)
plt.plot(K_range, silhouette_scores, marker="o", color="r")
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score Analysis")

plt.show()
No description has been provided for this image

Observations from the Elbow Method¶

  • The inertia drops sharply from k=2 to k=4, but after that, the decrease slows down.
  • The "elbow" (point where the curve starts to flatten) seems to be around k=4 or k=5.

Observations from the Silhouette Score (Right Plot)¶

  • Highest silhouette score at k=2 (0.28), meaning two well-separated clusters exist.
  • The score gradually decreases as k increases, indicating overlapping clusters.
  • After k=6, the silhouette score becomes very low (~0.21), suggesting poor separation.
InĀ [14]:
optimal_k=4
InĀ [12]:
kmeans = KMeans(n_clusters=optimal_k, init="k-means++",random_state=42)
kmeans.fit(X, y)
Out[12]:
KMeans(n_clusters=4, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, random_state=42)
InĀ [35]:
# 2D Visualization using TSNE

from sklearn.manifold import TSNE

tsne = TSNE(2 , perplexity=200, n_iter=300)
components_tsne = tsne.fit_transform(X)
InĀ [36]:
Kmeans_data = np.vstack((components_tsne.T, kmeans.labels_)).T
Kmeans_data
Out[36]:
array([[-0.60867536,  4.61441231,  0.        ],
       [ 1.26883852,  7.63253117,  0.        ],
       [ 2.23190188,  6.16152525,  0.        ],
       ...,
       [-0.61595494,  6.51414299,  0.        ],
       [-2.77767849,  5.40629578,  0.        ],
       [ 1.26801634,  1.6698674 ,  0.        ]])
InĀ [37]:
Kmeans_tsne = pd.DataFrame(Kmeans_data, columns=["X1", "X2", "clusters"])
Kmeans_tsne.head(10)
Out[37]:
X1 X2 clusters
0 -0.608675 4.614412 0.0
1 1.268839 7.632531 0.0
2 2.231902 6.161525 0.0
3 1.539832 5.844163 0.0
4 7.277162 2.938786 2.0
5 -0.075848 1.356522 0.0
6 -1.268180 4.740068 0.0
7 7.494108 -1.875750 1.0
8 4.927003 5.426473 2.0
9 -3.845872 6.579289 0.0
InĀ [38]:
plt.figure(figsize=(15, 8))
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Kmeans_tsne, palette="tab10")
plt.title("KMeans++ Clustering")

plt.show()
No description has been provided for this image

Gaussian Mixture Model¶

InĀ [18]:
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=optimal_k, covariance_type="full")
gmm_model=gmm.fit(X)
InĀ [19]:
GMM_labels = gmm_model.predict(X)
GMM_labels
Out[19]:
array([0, 0, 0, ..., 1, 1, 1], dtype=int64)
InĀ [20]:
from sklearn.metrics import silhouette_score

silhouette_score(X, GMM_labels)
Out[20]:
-0.043183584125432585
InĀ [21]:
GMM_data = np.vstack((components_tsne.T, GMM_labels)).T
GMM_data
Out[21]:
array([[-0.60867536,  4.61441231,  0.        ],
       [ 1.26883852,  7.63253117,  0.        ],
       [ 2.23190188,  6.16152525,  0.        ],
       ...,
       [-0.61595494,  6.51414299,  1.        ],
       [-2.77767849,  5.40629578,  1.        ],
       [ 1.26801634,  1.6698674 ,  1.        ]])
InĀ [22]:
GMM_tsne = pd.DataFrame(GMM_data, columns=["X1", "X2", "clusters"])
GMM_tsne.head(10)
Out[22]:
X1 X2 clusters
0 -0.608675 4.614412 0.0
1 1.268839 7.632531 0.0
2 2.231902 6.161525 0.0
3 1.539832 5.844163 3.0
4 7.277162 2.938786 0.0
5 -0.075848 1.356522 0.0
6 -1.268180 4.740068 0.0
7 7.494108 -1.875750 0.0
8 4.927003 5.426473 3.0
9 -3.845872 6.579289 0.0
InĀ [23]:
plt.figure(figsize=(15, 8))
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title('GMM Cluters')
plt.show()
No description has been provided for this image

Hierarchical Clustering¶

InĀ [57]:
from scipy.cluster import hierarchy

Z=hierarchy.linkage(X, method='ward')
InĀ [58]:
Z.shape, Z
Out[58]:
((6961, 4),
 array([[6.22000000e+02, 1.05000000e+03, 3.65905944e-01, 2.00000000e+00],
        [1.36400000e+03, 1.36500000e+03, 3.89680688e-01, 2.00000000e+00],
        [6.41000000e+02, 8.96000000e+02, 4.38727642e-01, 2.00000000e+00],
        ...,
        [1.38980000e+04, 1.39090000e+04, 1.09549943e+02, 7.09000000e+02],
        [1.39190000e+04, 1.39200000e+04, 1.23591245e+02, 5.83200000e+03],
        [1.39180000e+04, 1.39210000e+04, 1.85486960e+02, 6.96200000e+03]]))
InĀ [59]:
plt.figure(figsize=(12, 10))
hierarchy.dendrogram(Z)
plt.title('Dendogram of CLusters')
plt.ylabel('Euclidean Distance')
plt.show()
No description has been provided for this image
InĀ [24]:
optimal_k=4
InĀ [25]:
from sklearn.cluster import AgglomerativeClustering

agg_cluster = AgglomerativeClustering(
    n_clusters=optimal_k, metric="euclidean", linkage="ward"
)
agg_labels = agg_cluster.fit_predict(X)
InĀ [26]:
print(f"Silhouette Score: {silhouette_score(X, agg_labels)}")
Silhouette Score: 0.19928877106753798
InĀ [27]:
np.unique(agg_labels), agg_labels
Out[27]:
(array([0, 1, 2, 3], dtype=int64), array([1, 1, 1, ..., 1, 1, 3], dtype=int64))
InĀ [28]:
Hierarchy_data = np.vstack((components_tsne.T, agg_labels)).T
Hierarchy_data
Out[28]:
array([[-0.60867536,  4.61441231,  1.        ],
       [ 1.26883852,  7.63253117,  1.        ],
       [ 2.23190188,  6.16152525,  1.        ],
       ...,
       [-0.61595494,  6.51414299,  1.        ],
       [-2.77767849,  5.40629578,  1.        ],
       [ 1.26801634,  1.6698674 ,  3.        ]])
InĀ [29]:
hierarchy_tsne = pd.DataFrame(Hierarchy_data, columns=["X1", "X2", "clusters"])
hierarchy_tsne.head(10)
Out[29]:
X1 X2 clusters
0 -0.608675 4.614412 1.0
1 1.268839 7.632531 1.0
2 2.231902 6.161525 1.0
3 1.539832 5.844163 1.0
4 7.277162 2.938786 3.0
5 -0.075848 1.356522 1.0
6 -1.268180 4.740068 1.0
7 7.494108 -1.875750 3.0
8 4.927003 5.426473 1.0
9 -3.845872 6.579289 1.0
InĀ [30]:
plt.figure(figsize=(15, 8))
sns.scatterplot(x="X1", y="X2", hue="clusters", data=hierarchy_tsne, palette="tab10")
plt.title('Hierarchical Clusters')
plt.show()
No description has been provided for this image

Compare the clustering results of all the algorithms using Inertia and the Silhouette Score.¶

InĀ [31]:
print(f"Hierarchal Clustering Silhouette Score: {silhouette_score(X, hierarchy_tsne['clusters'])}")
Hierarchal Clustering Silhouette Score: 0.19928877106753798
InĀ [32]:
print(f"GMM Clutering Silhouette Score: {silhouette_score(X, GMM_tsne['clusters'])}")
GMM Clutering Silhouette Score: -0.043183584125432585
InĀ [33]:
print(f"KMeans++ Silhouette Score: {silhouette_score(X, Kmeans_tsne['clusters'])}")
KMeans++ Silhouette Score: 0.2509862709654033

The Silhouette Score of all the Models Turns out to be Good

  • The score is near to 1 which is a good sign
 The Best Result is given by Gausian Mixture Model 

Visualize the clusters formed using T-SNE for all the three algorithms.¶

InĀ [34]:
plt.figure(figsize=(20, 8))

plt.subplot(1,3,1)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=hierarchy_tsne, palette="tab10")
plt.title("Hierarchical Clusters")

plt.subplot(1, 3, 2)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title("GMM Clusters")

plt.subplot(1, 3, 3)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Kmeans_tsne, palette="tab10")
plt.title("KMeans++ Clusters")
plt.show()
No description has been provided for this image

Expected Insights¶

Identification of distinct patient groups based on health and lifestyle attributes.¶

InĀ [39]:
Orignal_data = np.vstack((components_tsne.T, y)).T
Orignal_tsne = pd.DataFrame(Orignal_data, columns=["X1", "X2", "clusters"])
InĀ [40]:
plt.figure(figsize=(20, 15))

plt.subplot(2, 1, 1)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Orignal_tsne, palette="tab10")
plt.title("Orignal Clusters")


plt.subplot(2, 1, 2)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title("GMM Clusters")

plt.show()
No description has been provided for this image

Comparison of clustering algorithms to determine which provides the most meaningful segmentation.¶

InĀ [41]:
plt.figure(figsize=(20, 15))

plt.subplot(2, 2, 1)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Orignal_tsne, palette="tab10")
plt.title("Orignal Clusters")


plt.subplot(2, 2, 2)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=hierarchy_tsne, palette="tab10")
plt.title("Hierarchical Clusters")

plt.subplot(2, 2, 3)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title("GMM Clusters")

plt.subplot(2, 2, 4)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Kmeans_tsne, palette="tab10")
plt.title("KMeans++ Clusters")
plt.show()
No description has been provided for this image

\

/

\

/

\

/

\

/

\

/

\

/

Apply Clustering Models¶

InĀ [42]:
# K-Means Clustering
optimal_k = 4  # Choose based on Elbow/Silhouette
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
X["KMeans_Cluster"] = kmeans.fit_predict(X)
InĀ [44]:
# Gaussian Mixture Model (GMM)

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=optimal_k, random_state=42)
X["GMM_Cluster"] = gmm.fit_predict(X)
InĀ [45]:
# Hierarchical Clustering

from sklearn.cluster import AgglomerativeClustering

agg_clustering = AgglomerativeClustering(n_clusters=optimal_k, linkage="ward")
X["Hierarchical_Cluster"] = agg_clustering.fit_predict(X)

Analyze Cluster Insights¶

InĀ [47]:
X.groupby("GMM_Cluster").mean()
Out[47]:
age gender chest pain type blood pressure cholesterol max heart rate exercise angina plasma glucose skin_thickness insulin bmi diabetes_pedigree hypertension heart_disease Residence_type smoking_status KMeans_Cluster Hierarchical_Cluster
GMM_Cluster
0 57.949393 0.497470 0.838057 124.438765 5.243208 161.608300 0.096660 4.548247 39.415992 4.700638 3.292613 0.383005 0.070850 0.043016 0.352733 0.630314 3.0 0.338057
1 54.518430 0.676576 0.045779 84.481570 5.165301 165.567776 0.004162 4.623849 35.964328 4.724797 3.331616 0.374696 0.062426 0.027943 0.265755 0.460166 1.0 1.914982
2 57.939480 0.516403 0.983597 128.473416 5.252657 162.115385 0.119910 4.542947 78.730204 4.695637 3.262908 0.383483 0.066742 0.033371 0.359163 0.576640 0.0 2.336538
3 59.454427 0.435547 0.137370 96.428385 5.174824 165.274089 0.013021 4.534976 76.798177 4.708984 3.351350 0.384269 0.087891 0.054688 0.477865 0.833333 2.0 1.274089

Visualize Clustering Results¶

InĀ [49]:
from sklearn.decomposition import PCA
import seaborn as sns

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(20, 20))

plt.subplot(2,2,1)
sns.scatterplot(
    x=X_pca[:, 0], y=X_pca[:, 1], hue=X["KMeans_Cluster"], palette="viridis"
)
plt.title("KMeans")

plt.subplot(2, 2, 2)
sns.scatterplot(
    x=X_pca[:, 0], y=X_pca[:, 1], hue=X["GMM_Cluster"], palette="viridis"
)
plt.title("GMM")

plt.subplot(2, 2, 3)
sns.scatterplot(
    x=X_pca[:, 0], y=X_pca[:, 1], hue=X["Hierarchical_Cluster"], palette="viridis"
)
plt.title("Hierarchical_Clusters")

plt.subplot(2, 2, 4)
sns.scatterplot(
    x=X_pca[:, 0], y=X_pca[:, 1], hue=y, palette="viridis"
)
plt.title("orignal data")
plt.show()
No description has been provided for this image
InĀ [52]:
pd.crosstab(df['triage'], X["KMeans_Cluster"])
Out[52]:
KMeans_Cluster 0 1 2 3
triage
green 75 169 87 109
orange 172 1 8 165
red 24 53 14 38
yellow 1470 1277 1289 1601